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Abstract 

In order to scale economically, data centers are increasingly evolving their data storage methods from the use 
of simple data replication to the use of more powerful erasure codes, which provide the same level of reliability 
as replication-based methods at a significantly lower storage cost. In particular, it is well known that Maximum- 
Distance-Separable (MDS) codes, such as Reed-Solomon codes, provide the maximum storage efficiency. While 
the use of codes for providing improved reliability in archival storage systems, where the data is less frequently 
accessed (or so-called "cold data"), is well understood, the role of codes in the storage of more frequently 
accessed and active "hot data" is less clear. In fact, a key question is: when the performance metric is no longer 
data reliability but rather latency, do codes even help? 

In this paper, we answer this question in the affirmative by studying coded data storage systems based on MDS 
codes through the lens of queueing theory, and term this the "MDS queue." We analytically characterize the latency 
performance of MDS queues, and reveal its superior performance (up to 70%) compared to that of currently used 
replication-based schemes. In our analysis of MDS queues, we present insightful scheduling policies that form 
upper and lower bounds to performance, and show that they are quite tight. Extensive simulations of the MDS 
queue using Markov-Chain-Monte-Carlo (MCMC) methods are also provided and used to validate our theoretical 
analysis. As a side note, our lower-bound analytical method based on the so-called MDS-Reservation(t) queue, 
represents an elegant practical scheme that requires the maintenance of considerably smaller state, depending on 
the parameter t, than that of the full-fledged MDS queue (which corresponds to t = oo), and may be of independent 
interest in practical systems. 



I. Introduction 

Two of the primary objectives of a storage system are to provide reliability and availability of the stored data: 
the system must ensure that data is not lost even in the presence of individual component failures, and must be 
easily accessible to the user whenever required. The classical means of providing reliability is to employ the 
strategy of replication, wherein identical copies of the (entire) data are stored on multiple servers. However, this 
scheme is not very efficient in terms of the storage space utilization. The exponential growth in the amount of 
data being stored today makes storage an increasingly valuable resource, and has motivated data-centers today 
to increasingly turn to the use of more efficient storage codes (lJ-Q. 

Fig. [1] provides an illustrative example comparing the replication and coded schemes. Here, four files Fa, Fg, 
Fc and Fjj, are to be stored in a reliable manner across four servers. The replication strategy, as depicted in 



Fig. la stores {Fa,Fb} in the first two servers and {Fc,Fd} in the two remaining servers. On the other hand, the 
strategy of coding as shown in Fig. [TbJ partitions each file into two halves as F x = [f x \ f X 2\ for x E {A,B,C,D}, 
and stores the four sets {f x i}, {fx2}, {fxi + fx2}, and {f x i + 2f X 2} in four different servers. Observe that the 
total storage space required under this coding scheme is identical to that required under the replication scheme. 
However, while the replication scheme loses files Fa and Fb upon failure of the first two servers, and loses 
files Fc and Fry upon failure of the two other servers, the coding scheme loses no data even upon failure of 



(a) 



(b) 



Fig. 1: An example comparing storage schemes based on (a) replication and (b) (MDS) codes. The data to be 
stored consists of four files Fa, Fb, Fq and Fq. Each of these files are split into two halves in the coded system 
as F x = {f x i,f X 2}, for x S {A,B,C,D}. Observe that in the coded system, any file an be recovered by reading 
data from any two of the four servers. 



any two of the four servers. Thus, the coding scheme in this example provides a higher reliability as compared 
to replication, for the same level of storage efficiency. While this was only a toy example, the amount of gains 
offered by codes can be significantly higher when the file is split into a larger number of chunks. Furthermore, 
any redundancy scheme based on replication must store at least one additional copy of the entire data, which 
necessitates a storage overhead of at least 100%. This may itself be prohibitive in many cases. On the other hand, 
codes do not face such a barrier and can support smaller overheads. 

The most popular, and also most efficient storage codes are a class of codes known as Maximum-Distance- 
Separable (MDS) codes J5J. The Reed-Solomon code |6j is an example of an MDS code, and so is the code 



depicted in Fig. lb An MDS code is typically associated to two parameters n and k. Under an (n,k) MDS 
code, a file is encoded and stored in n servers such that (a) the data stored in any k of these n servers suffice 
to recover the entire file, and (b) the storage space required at each server is \ of the size of the original file. Q 

As a result of the storage savings they offer, codes have been employed in data centers to store 'cold' data, 
i.e., data that is less frequently accessed, where reliability and storage efficiency are the primary metrics. While 
this application of codes is fairly well studied, the performance of codes for the storage of more frequently 
accessed "hot data" is less clear. In this context, a key question is: when the performance metric is no longer 
data reliability but rather latency, do codes even help? 

In this paper, we answer this question in the affirmative by studying coded data storage systems based on MDS 
codes through the lens of queueing theory, and term the queue resulting out of the use of MDS codes as the MDS 
queue. We analytically characterize the latency performance of MDS queues, and reveal its superior performance 
(up to 70%) as compared to that of currently used replication-based schemes, under the models considered in 
this paper. In our analysis of MDS queues, we also present insightful scheduling policies that form upper and 
lower bounds to performance, and show that they are quite tight. Extensive Markov-Chain-Monte-Carlo (MCMC) 
simulations of the MDS queue are also provided and used to validate our theoretical analysis. 



A. Definition of the MDS queue 

We shall now describe the queueing theoretic model of a system employing an MDS code. To arrive at 
this model, we first make some simplifying assumptions in the interests of conceptual simplicity. We assume 
homogeneity among files, among requests, and among servers: the files are of identical size, incoming requests 

'A more rigorous definition of an MDS code is that it is a code that satisfies the 'Singleton bound' |5|. 
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Fig. 2: The percentage reduction in average latency of a coded system over a replication-based system, for a storage 
system with parameters n = 10, k = 5 and service rate fi = l. The curve titled 'MDS' is the reduction achieved by 
the exact coded system. Also plotted are the gains achieved by the lower bounds (MDS-Reservation(t) queues) 
and upper bounds (M^/M/n(t) queues) presented in this paper. The graphs plotted are from MCMC simulations, 
which closely match the analytical evaluations of the lower and upper bounds. The fall in the gains achieved by 
some of the lower bounds at high arrival rates is because of their lower throughput, which makes the queues 
associated with the bounds unstable at arrival rates close to their maximum throughput. 



are distributed as a Poisson process independent of the state of the system, and every incoming request asks for 
reading out one particular file with the reading process at every server following an independent and identical 
exponential distribution. |^] 

As discussed previously, under an MDS code, a file can be retrieved by downloading data from any k of the 
servers. We model this by treating each request for reading a file as a batch of k jobs. The k jobs of a batch 
represent the reads of k parts of the file from k servers. A batch is considered as served when all k of its jobs 



have completed service. For instance, a request for reading file Fa in the system depicted in Fig. lb is treated 
as a batch of two jobs. To service this request, the two jobs may be served by any two of the four servers; for 
example, if the two jobs are served by servers 2 and 3, then they correspond to reading /a2 and (/A1 + /A2) 
respectively, which suffices to obtain the desired file Fa- 

Definition 1 (MDS queue): An MDS queue is associated to four parameters (n,k) and [X,/j]. 

• There are n identical servers 

• Requests enter into a (common) buffer of infinite capacity 

• Requests arrive in batches of k jobs each 

• These batches arrive as a Poisson process with a rate of A 

• Each of the k jobs in a batch can be served by an arbitrary set of k distinct servers 



2 While the service times in practice are unlikely to be distributed exponentially, such an assumption is meant to serve as a starting 
point for more rigorous analyses. 
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• The service time for a job at any server is exponentially distributed with a rate of /j,, and is independent of 
the arrival and service times of all other jobs 

• The jobs are processed in order, i.e., among all the waiting jobs that an idle server is allowed to serve, it 
serves the one which had arrived the earliest. 

The scheduling policy that governs this queue is formalized in Algorithm [T] 

Algorithm 1 MDS scheduling policy 
on arrival of batch 
assign as many jobs (of the new batch) as possible to idle servers 
append the remaining jobs (if any) as a new batch at the end of the buffer 
on departure from a server (say, server s) 
if 3 at least one batch in the buffer such that no job of this batch has been served by s then 
among all such batches, find the batch that had arrived earliest 
assign a job from this batch to s 
end if 



The following example illustrates the functioning of the MDS scheduling policy and the resultant MDS queue. 



Example 1: Consider the MDS(n=4,k=2) queue, as depicted in Fig. [3] Here, each request comes as a batch 
of k = 2 jobs, and hence we denote each batch (e.g., A, B, C, etc.) as a pair of jobs (e.g., {^i,^}, {B\,B2\, 
{C\,C2}, etc.). The two jobs in a batch need to be served by (any) two distinct servers. Denote the four servers 



(from left to right) as servers 1, 2, 3 and 4. Suppose the system is in the state as shown in Fig. 3a wherein the 
jobs A2, Ai, B\ and B2 are being processed by the four servers, and there are three more batches waiting in 
the buffer. Suppose server 1 completes servicing job A2 (Fig. |3b] ). Now, this server is free to serve any of the 6 
jobs waiting in the buffer. However, since we allow jobs to be processed only in order, server 1 begins servicing 
job Ci (assignment of C2 instead would also have been valid). Next, suppose the first server completes service 



of C\ before any other servers complete their current tasks (Fig. 3c). In this case, since server 1 has already 
served a job from the {C\,C2\ batch, it is not allowed to service C2 (due to the restriction in the 5 th bullet of 
the definition of the MDS queue). However, it can service any job from the next batch {D\,D2}, and one of 
these two jobs is (arbitrarily) assigned to it. Finally, when one of the other servers completes its service, then 



that server is assigned to service C2 (Fig. 3d). 



It turns out that a special case of the MDS queue, when k = l, is equivalent to an M/M/n queue. The MDS 
queue for a general value of k differs from the M/M/n queue in two ways. In the MDS(n,k) queue, 

• jobs arrive in batches of k, and 

• each job in a batch must be served by a different server. 

An exact analysis of the MDS queue is hard: a Markov chain representation to keep track of only the 
configuration of the buffer requires a state space that is infinite in k dimensions. Thus, in this paper, we provide 
scheduling policies for the MDS queue that lower/upper bound its exact performance. 




(a) (b) (c) (d) 

Fig. 3: Illustration of the functioning of the MDS queue. Requests take the form of batches, A, B, C etc., each 
comprising of k = 2 jobs {A\,A-i\, {B\,B2}, {Ci,^}, etc. The two jobs of any batch can be served by any 
two distinct servers. 



B. Summary of results and contributions 

In this paper, we undertake a queueing theoretic study of the performance of coded systems with respect to 
the key metric of latency. We model the system as a queue, which we term the "MDS queue." 

We provide scheduling policies to bound the performance of the MDS queue. These scheduling policies are 
chosen so as to capture the most significant features of the MDS queue, while allowing analytical tractability. 
Analysing these scheduling policies in terms of several performance metrics, we provide tight bounds on the 
performance of the MDS queue. In addition, we also provide Markov-Chain-Monte-Carlo (MCMC) simulations 
for the MDS queue, which are used to validate our analytical results, and also target other metrics such as the 
tails of the distribution of the latency that are hard to evaluate otherwise. 

Our analysis and simulations reveal that codes can achieve significant gains over replication based systems, 
under the models considered in this paper. For example, the average latency incurred by a read-request in a coded 
system may enjoy upto a 70% reduction in latency on an average as compared to replication based schemes, 
as shown in Fig. [2J the 99 th percentile tails of the latency may reduce by upto 50%, as shown subsequently in 
the paper. The key insight is that the property of being able to recover the data from any k of the servers that 
lends MDS codes a high storage efficiency, also allows for greater parallel access, and is thus instrumental in 
providing it a low latency. This paper quantifies these gains. 

In addition to analysing the performance of coded systems with respect to several standard metrics, we also 
employ the framework presented in this paper to study additional topics of interest, such as comparing various 
methods of performing degraded reads, and understanding the effect of flooding requests. 

The lower bounds (the MDS-Reservation( t) scheduling policies) and the upper bounds (the M^/M/n(t) scheduling 
policies) are both indexed by a parameter 't'. An increase in the value of t results in tighter bounds, but 
also increases the complexity of analysis. Furthermore, both classes of scheduling policies converge to the 
MDS scheduling policy as t— >-oo. We note that the MDS-Reservation(t) scheduling policies presented here are 
themselves a practical alternative to the MDS scheduling policy, since they require maintenance of a smaller 
state, while offering highly comparable performance. While the performance as well as the complexity of the 
MDS-Reservation(t) queue increases with an increase in the value of the parameter t, our analysis shows that its 
performance is very close to that of the MDS queue for very small values of t (as small as t= 3). Likewise, the 
performance of the upper bounds closely follow that of the MDS queue for values of t as small as t= 1. 
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To allow the reader to evaluate performance of these queues under any parameters of choice, the source code 
of the MCMC simulations as well as those for computing steady-state distributions and performance metrics of 
the proposed queues have been made freely downloadable from the websites of the first two authors. 

The remainder of the paper is organized as follows. Section [IT] provides a background on this problem and 



presents related literature. Section III describes the general approach and the notation followed in the paper. 
Section [Tvl presents the MDS-Reservation(t) queues, that lower bound the performance of the MDS queue- 
Section (vTpresents the M^/M/n(t) queues, that are upper bounds to the performance of the MDS queue. Section 



VI 



presents an analysis and comparison of each of these queues, and queues resulting from replication-based storage 



schemes. The paper concludes with a discussion in Section VII The appendix contains proofs of the theorems 
presented in the paper. 



II. Related Literature 
A. Latency and storage efficiency in data-centers 

1 ) Latency analysis: The study of latency comparisons between replication and coded systems was initiated 
by Huang et al. in |;7j, which we build upon in this paper. They presented a scheduling policy termed the 'block- 
one-scheduling' policy, that provides a lower bound on the performance of the MDS queue. They analysed 
this policy for k = 2 and showed that this scheme improves the average latency faced by a job by upto 17% as 
compared to the replication scheme. The block-one-scheduling policy is a special case of the MDS-Reservation(t) 
queues presented in this paper, and corresponds to the case when t= 1. While the block-one-scheduling policy 
was analysed for k = 2 in [7], the analysis in this paper presents applies to any values of (n,k), and recovers the 
corresponding results of (7J as a special case. In addition, the MDS-Reservation(t) queues of this paper, when 
t> 2, provide considerably tighter bounds to the performance of the MDS queue. 

2 ) Blocking probability: The blocking probability of the MDS queue in the absence of a buffer was previously 
studied in [8], for arbitrary service time distributions. In this paper, as a special case of the MDS queue analysis, 
we study the blocking probabilities for the case when there is no buffer as well as for the case when there is a 
buffer with a finite capacity. However, in this paper, we restrict our attention to the case when the service times 
are exponentially distributed. 

3) Request-flooding: One possible means of reducing latency is to send any request (redundantly) to more 
servers than necessary, and collecting the results from the first of the servers to respond. Such 'request-flooding' 
policies have been often suggested in the literature ||9j-[ 11 1 for various applications. In particular, Joshi et al. pl| 



consider a setting very similar to the one considered in this paper. They propose a request-flooding policy wherein 
a batch would be sent to all n servers, and upon completion of any k of the n jobs, the remaining jobs would be 
cancelled. They provide bounds on the average latency faced by a batch in the steady-state. However, request- 
flooding to all n servers may be relatively resource expensive, and furthermore, the cancellation of jobs would 
typically mandate a certain cost (which is not accounted for in the models considered in [fTT]). 

In this paper, we introduce an additional flooding parameter, p. In this new setting, a batch may be served by 
upto p (>k) arbitrary servers, and the batch is considered as served whenever any k of these servers complete 
service. In addition, we assume that the cancellation of a job at a server requires the server to remain idle for a 
certain time, accounting for the cost incurred upon the cancellation. In addition to the rigorous analysis for p = k 
that forms a bulk of this paper, we perform MCMC simulations to evaluate and compare the system performance 
for various values of the flooding parameter p. 
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4) Degraded Reads: The coded system presented so far assumes that each incoming request desires one 
complete file F x , for some x € {A,B,C, . ..}. However, in certain applications, some incoming requests may require 
only a part of the file, say, f x £ for some £ G {l,...,k}. Let us assume that the code employed is systematic: servers 
1,2,..., A; respectively store the data f x i, fx.2, fxk- In this case, a request for f x £ can be served by reading f x £ 
directly from the £ tb server. In the event that server I is busy or unavailable, f x £ will have to be recovered from 
the data stored in the remaining (n — 1) servers. Such an operation is called a 'degraded read'. 

Under any MDS code, a degraded read may be performed by downloading the data stored in any k of the 
remaining (n — 1) servers to obtain F x , and then extracting the desired chunk f x £ from F x . An alternative means 



of performing degraded reads has been proposed recently in 1 12|-fl4j. In particular, practical MDS codes have 
been proposed in p4j (termed the 'product-matrix codes') that are associated to an additional parameter d, 
and can recover f x £ of file F x by reading and downloading small fractions of the data stored in any d of the 
remaining (n — 1) servers. In this paper, we employ our queueing theoretic framework analyse and compare the 
latency-performance of these two methods of performing degraded reads. 



B. Diversity and error-correction 

The MDS queue also arises in other applications that require diversity or error correction. For instance, consider 
a system with n processors, with jobs arriving as a Poisson process. It is often the case that the processors are 
not completely reliable (T5J, and may give incorrect outputs at random times. In order to guarantee a correct 
output, a job may be processed at k different servers, and the results aggregated (perhaps by a majority rule) to 
obtain the final answer. Such a system results precisely in an MDS(n,k) queue. In general, queues where jobs 
require diversity, for purposes such as security, error-protection etc., may be modelled as an MDS queue. F] 



C. Fork-join queues for parallel processing 



A class of queues that are closely related to the MDS queue are the fork-join queues [16j, |17|. A fork-join 
queue is used to model systems wherein jobs require multiple resources that can be provided in parallel. These 
queues are similar to the MDS queue in the sense that in both queues, jobs arrive in batches of k, and each job 
needs to be served by one distinct server out of the n servers. However, the distinction between these two classes 
of queues is that under a fork-join queue, each job must be served by a particular pre-specified server, while 
under an MDS queue, the k servers serving the jobs can be chosen by the scheduling policy. It thus follows 
that the performance of an MDS(n,k) queue is lower bounded by that of an (n,k) fork-join queue, with the gap 
quantifying the gains due to the flexibility in scheduling. Furthermore, as a special case, the MDS(n,n) queue is 
identical to the the (n,n) fork-join queue. 



III. Our Approach and Notation 

For each of the scheduling policies presented in this paper (for lower/upper bounding the MDS queue), we 
represent the respective resulting queues as continuous time Markov chains. We show that these Markov chains 
belong to a class of processes known as Quasi-Birth-Death (QBD) processes (described below), and obtain their 
steady-state distribution by exploiting the properties of QBD processes. 

3 An analogy that the academic will relate to is that of reviewing papers. There are n reviewers in total, and each paper must be reviewed 
by k reviewers. This forms an MDS(n,k) queue. The values of A and \i should be chosen such that - is close to the maximum throughput, 
modelling the fact that reviewers are generally busy. 
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Throughout the paper, we shall refer to the entire setup described in Section I-A as the 'queue' or the 'system'. 
We shall say that a batch is waiting (in the buffer) if at-least one of its jobs is still waiting in the buffer (i.e., 
has not begun service). We shall use the term "i rh waiting batch" to refer to the batch that was the 7 th earliest 
to arrive, among all batches currently waiting in the buffer. For example, in the system in the state depicted in 



Fig. 3a there are three waiting batches: {Ci,C2}, {Di,D%} and {Ei,E2} are the first, second and third waiting 



batches respectively. 

Table [I] enumerates notation for various parameters that describe the system at any given time. To illustrate 
this notation, consider again the system depicted in Fig. 3a Here, the parameters listed in Table [I] take values 



m = 10, z = 0, 6 = 3, si = 0, S2 = 0, S3 = 0, w\ = 2, W2 = 2 and W3 = 2. One can verify that keeping track of these 
parameters leads to a valid Markov chain (under each of the scheduling policies discussed in this paper). Note 
that we do not keep track of the jobs of a batch once all k jobs of that batch have begun to be served, nor 
do we track what servers are serving what jobs. This is to ensure a smaller complexity of representation and 
computation. Further note that in terms of the parameters listed in Table [TJ the number of servers that are busy at 
any given time is equal to (n—z). For batch % in the buffer (i £ {1,...,6}), the number of jobs that have completed 
service is equal to (k— Si — Wi). For any integer i, Wi = will mean that there is no i th waiting batch in the buffer. 

We shall frequently refer to an MDS queue as MDS(n,k) queue, and assume [X,fj] to be some fixed (known) 
values. The system will always be assumed to begin in a state where there are no jobs in the system (i.e., with 
m = & = 0). Since the arrival and service time distributions have valid probability density functions, we shall 
assume that no two events occur at precisely the same time. We shall use the notation a + to denote max(a,0). 

Review of Quasi-Birth-Death (QBD) processes: Consider a continuous-time Markov process on the states 
{0,1,2,...}, with transition rate Ao from state to 1, A from state i to (i+1) for all i>l, /xo from state 1 to 
0, and fj, from state + to i for all i>l. This is a birth-death process. A QBD process is a generalization 
of such a birth-death process, wherein, each state i of the birth-death process is replaced by a set of states. 
The states in the first set (corresponding to i = in the birth-death process) is called the set of boundary states, 
whose behaviour is permitted to differ from that of the remaining states. The remaining sets of states are called 
the levels, and the levels are identical to each other (recall that all states i > 1 in the birth-death process are 
identical). The Markov chain may have transitions only within a level or the boundary, between adjacent levels, 
and between the boundary and the first level. The transition probability matrix of a QBD process is thus of the 



Value 


Meaning 


Range 


m 
z 
6 


number of jobs in the entire system 
number of idle servers 
number of waiting batches 

number of of jobs of 7 th waiting batch, in the servers 
number of of jobs of 7 th waiting batch, in the buffer 


to 00 
to 77 
to 00 
to k-1 
to k 



TABLE I: Notation used to describe state of the system. 
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Here, the matrices Bo, B\, B 2 , Ao, A\ and A 2 represent transitions entering the boundary from the first level, 
within the boundary, exiting the boundary to the first level, entering a level from the next level, within a level, 
and exiting a level to the next level respectively. If the number of boundary states is and if the number of 
states in each level is qi, then the matrices Bo, B\ and B 2 have dimensions (q^xqi,), (qbXqb) and (qbXqe) 
respectively, and each of Aq, A\ and A 2 have dimensions (q? xqi). The birth-death process described above is a 
special case with qb = qe = l and Bo = fJ-o, B\ = 0, B 2 = Ao, Ao = [i, A\ = 0, A 2 = A. Figures [5j [7] and 10 in the 
sequel also present examples of QBD processes. 

QBD processes are very well understood p"8| , and their stationary distribution is fairly easy to compute. In 
this paper, we employ the SMCSolver software package p9| for this purpose. In the next two sections, we 
present scheduling policies which lower and upper bound the performance of the MDS queue, and show that the 
resulting queues can be represented as QBD processes. This property makes them them easy to analyse, and this 



is exploited subsequently in the analysis presented in Section VI 



IV. The MDS-Reservation(t) Queues: Lower Bounds 

We now present a class of scheduling policies (and the resulting queues), which we call the MDS-Reservation(t) 
scheduling policies (and MDS-Reservation(t) queues), whose performance lower bounds the performance of the 
MDS queue. We shall see subsequently that the MDS-Reservation(t) scheduling policies are practical policies that 
require maintenance of a much smaller state as compared to the MDS queue, and hence may be of independent 
interest. This class of scheduling policies are indexed by a parameter 't': a higher value of t leads to a better 
performance and a tighter lower bound to the MDS queue, but on the downside, requires maintenance of a larger 
state and is also more complex to analyse. 

The MDS-Reservation(t) scheduling policy, in a nutshell, is as follows: 

"apply the MDS scheduling policy, but with an additional restriction that for any i£{t+l,t + 2,...}, the 
i th waiting batch is allowed to move forward in the buffer only when all k of its jobs can move forward 
together." 

We first describe in detail the special cases of t=0 and t=l, before moving on to the scheduling policy for a 
general t. 



A. MDS-Reservation(O) 



1) Scheduling policy: The MDS-Reservation(O) scheduling policy is rather simple: the batch at the head of the 
buffer may start service only when k or more servers are idle. The policy is described formally in Algorithm [2] 
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Fig. 4: An illustration of the MDS-Reservation(O) scheduling policy for a system with parameters (n = A,k = 2). 
This policy prohibits the servers to process jobs from a batch unless there are k idle servers that can process all k 
jobs of that batch. As shown in the figure, server 1 is barred from processing {C±,C2} in (a), but is subsequently 
allowed to do so when another server also becomes idle in (b). 



Algorithm 2 MDS-Reservation(O) Scheduling Policy 
on arrival of a batch 
if number of idle servers < k then 
append new batch at end of buffer 

else 

assign k jobs of the batch to any k idle servers 
end if 
on departure from server 
if (number of idle servers > k) and (buffer is non-empty) then 

assign k jobs of the first waiting batch to any k idle servers 
end if 



Example 2: Consider the MDS(n=4,k=2) queue in the state depicted in Fig. [3a] Suppose the server 2 completes 



processing job A\ (Fig. 4a). Upon this event, the MDS scheduling policy would have allowed server 2 to take up 
execution of either C\ or C2. However, this is not permitted under MDS-Reservation(O), and this server remains 
idle until a total of at least k = 2 servers become idle. Now suppose the third server completes execution of 
B\ (Fig. [4b] ). At this point, there are sufficiently many idle servers to accommodate all k = 2 jobs of the batch 
{Ci,C2}, and hence jobs C\ and C2 are assigned to servers 2 and 3. 



We note that the MDS-Reservation(O) queue, when n = k, is identical to a split-merge queue [20]. 



2) Analysis: Observe that under the specific scheduling policy of MDS-Reservation(O), a batch that is waiting 
in the buffer must necessarily have all its k jobs in the buffer, and furthermore, these k jobs go into the servers 
at the same time. 

We now describe the Markovian representation of the MDS-Reservation(O) queue. We show that it suffices to 
keep track of only the total number of jobs m in the entire system. 

Theorem 1: A Markovian representation of the MDS-Reservation(O) queue has a state space {0,1,.. .,00}, 
and any state m£{0,l,...,oo} has transitions to: (i) state (m+k) at rate A, (ii) if m<n then to state (m — 1) 
at rate m/u, and (hi) if m>n then to state (m — 1) at rate (n — (n — m) mod k))p. The MDS-Reservation(O) 
queue is thus a QBD process, with boundary states {0,1, ...,n— k}, and levels me {n — k+l+jk,...,n+jk} for 
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Fig. 5: State transition diagram of the MDS-Reservation(O) queue for n = 4 and k = 2. The notation at any state 
is the number of jobs m in the system in that state. The set of boundary states are {0,1,2}, and the levels are 
pairs of states {3,4}, {5,6}, {7,8}, etc. The transition matrix is of the form ([T} with Bo = [0 3/x ; 0], 

Bx = [-X A ; fi -(/x+A) 0; 2^ -(2/^+A)], B 2 = [0 0; A 0; A], A = [0 3/i ; 0], 
Ai = [-(3/x+A) 0; Afi -(4/x+A)], A 2 = [A 0; A]. 



j = {0,l,...,oo}. 

The state transition diagram of the MDS-Reservation(O) queue for (n = 4,/c = 2) is depicted in Fig. [5] 

Theorem [T] shows that the MDS-Reservation(O) queue is a QBD process, allowing us to employ the SMC solver 
to obtain its steady-state distribution. Alternatively, this queue is simple enough to analyse analytically as well. 
Let y(m) denote the number of jobs being served when the Markov chain is in state m. From the description 
above, this function can be written as: 



y(m) 



m, if 0<i<n 

n— ((n — m) mod k), if m>n . 



Let 7r= [7To 7Ti 7T2 •■•] denote the steady-state distribution of this chain. The global balance equation for the cut 
between states (m— 1) and m gives: 



A 

7T r 



m— 1 



TTj ] V m>0. (2) 

, j=(m-fc)+ 



Using these recurrence equations, for any given (n,k), the distribution 7r of the number of jobs in steady-state 
can be computed easily. 



B. MDS-Reservation(l) 

1) Scheduling policy: The MDS-Reservation(O) scheduling policy discussed above allows the batches in the 
buffer to move ahead only when all k jobs in the batch can move together. The MDS-Reservation(l) scheduling 
policy relaxes this restriction for (only) the job at the head of the buffer. This is formalized in Algorithm [3] 
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Fig. 6: An illustration of the MDS-Reservation(l) scheduling policy, for a system with parameters (n = A,k = 2). 
As shown in the figure, this policy prohibits the servers from processing jobs of the second or later batches (e.g., 
{D\,D2} and E\,Ei in (b)), until they move to the top of the buffer (e.g., {D\,D2\ in (c)). 



Algorithm 3 MDS-Reservation(l) Scheduling Policy 

on arrival of a batch 
if buffer is empty then 

assign one job each from new batch to (at most k) idle servers 
end if 

append remaining jobs of batch to the end of the buffer 
on departure from server (say, server s): 
if buffer is non-empty and no job from first waiting batch has been served by s then 
assign a job from first waiting batch to s 

if first waiting batch had only one job in buffer & there exists another waiting batch then 

to every remaining idle server, assign a job from second waiting batch 
end if 
end if 



Example 3: Consider the MDS(n=4,k=2) queue in the state depicted in Fig. [3a] Suppose server 2 completes 



processing job A\ (Fig. 6a). Under MDS-Reservation(l), server 2 now begins service of job C\ (which is allowed 



by MDS, but was prohibited under MDS-Reservation(O)). Now suppose that server 2 finishes this service before 



any other server (Fig. 6b i. In this situation, since server 2 has already processed one job from batch {C\,C2), 
it is not allowed to process C2. However, there exists another batch {D±,D2} in the buffer such that none of 
the jobs in this batch have been processed by the idle server 2. While the MDS scheduling policy would have 
allowed server 2 to start processing D\ or D2, this is not permitted under MDS-Reservation(l), and the second 
server remains idle. Now, if server 3 completes service (Fig. [6c]), then C2 is assigned to server 3, allowing batch 



{D\,D2} to move up as the first batch. This now permits server 2 to begin service of job D\. 

The MDS-Reservation(l) scheduling policy is identical to the block-one-Scheduling policy proposed in j7). 
While this scheme was analysed in JTJ only for the case of k = 2, in this paper, we present a more general 
analysis that holds for all values of the parameter k. 

2) Analysis: The following theorem describes the Markovian representation of the MDS-Reservation(l) queue. 
Each state in this representation is defined by two quantities: (i) the total number of jobs m in the system, and 
(ii) the number of jobs w\ of the first waiting batch, that are still in the buffer. 
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Fig. 7: State transition diagram of the MDS-Reservation(l) queue for n = 4 and k = 2. The notation at any 
state is (wi,m). The subset of states that are never visited are not shown. The set of boundary states are 
{0,1,2,3,4} x {0,1,2}, and the levels are sets {5,6} x {0,1,2}, {7,8} x {0,1,2}, etc. 

Theorem 2: The Markovian representation of the MDS-Reservation(l) queue has a state space {0,l,...,k}x 
{0,1,. ..,oo}. It is a QBD process with boundary states {0,...,A;} x {0,...,n}, and levels {0,...,k} x{n — k + l + 
jk,...,n+jk} for j = {l,2,...,oo}. 

The state transition diagram of the MDS-Reservation(l) queue for (n = 4,/c = 2) is depicted in Fig. [7] 

Note that the state space {0,l,...,A;}x{0,l,...,oo} has several states that will never be visited during the 
execution of the Markov chain. For instance, the states (wi >0,m<n — k) never occur. This is because w\ >0 
implies existence of some job waiting in the buffer, while m<n — k implies that k or more servers are idle. The 
latter condition implies that there exists at least one idle server that can process a job from the first waiting batch, 
and hence the value of w\ must be smaller than that associated to that state, thus proving the impossibility of 
the system being in that state. 

C. MDS-Reservation( t) for a general t 

1) Scheduling policy: Algorithm [4] formally describes the MDS-Reservation(t) scheduling policy. 

Algorithm 4 MDS-Reservation(t) 
on arrival of a batch: 
if 3 < t batches in the buffer then 

assign 1 job each from new batch to every idle server 
end if 

append remaining jobs of batch to the end of the buffer 
on departure from server (say, server s): 
find i = mm{i> 1 :no job of the i th waiting batch has been served by s} 
if i exists & i<t then 

assign a job of i th waiting batch to s 

if i = 1 & the first waiting batch had only one job in buffer & there exists another waiting batch then 

to every remaining idle server, assign a job from second waiting batch 
end if 
end if 



The following example illustrates the MDS-Reservation(t) scheduling policy when t=2. 
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Fig. 8: An illustration of the working of the MDS-Reservation(2) scheduling policy, for a system with parameters 
(n = A,k = 2). As shown in the figure, this policy prohibits the servers from processing jobs of the third and later 
batches (e.g., batch {£©£2} in (c)), until they move higher in the buffer (e.g., as in (d)). 



Example 4: (t=2). Consider the MDS(n=4,k=2) queue in the state depicted in Fig. 3a Suppose the second 



server completes processing job A\ (Fig. 8a). Under the MDS-Reservation(2) scheduling policy, server 2 now 
begins service of job C\. Now suppose that server 2 finishes this service as well, before any other server completes 



its respective service (Fig. 8b I. In this situation, while MDS-Reservation(l) would have mandated server 2 to 
remain idle, MDS-Reservation(2) allows it to start processing a job from the next batch {L>i,L>2}- However, if 



the server also completes processing of this job before any other server (Fig. 8c I, then it is not allowed to take up 



a job of the third batch {Ei,^}- Now suppose server 3 completes service (Fig. 8di. Server 3 can begin serving 



job C*2, thus clearing batch {©1,62} from the buffer, and moving the two remaining batches up in the buffer. 
Batch {£©£2} is now within the threshold of t = 2, allowing it to be served by the idle server 2. 

2) Analysis: 

Theorem 3: The Markovian representation of the MDS-Reservation(t) queue has a state space {0,1,. ..,&}* x 
{0,1,. . .,00}. It is a QBD process with boundary states {0,.. .,&}* x {0,...,n— k+tk}, and levels {0,...,fc}* x {n — 
k + l+jk,...,n+jk} for j = {t,t+l,...,oo}. 

One can see that the sequence of MDS-Reservation(t) queues, as t increases, becomes closer to the MDS 
queue. This results in tighter bounds, and also increased complexity of the transition diagrams. The limit of this 
sequence is the MDS queue itself. 



Theorem 4: The MDS-Reservation(t) queue, when t = oo, is precisely the MDS queue. 



V. The M k /M/n(t) Queues: upper bounds 

In this section, we present a class of scheduling policies (and resulting queues), which we call the M^/M/n(t) 
scheduling policies (and M^/M/n(t) queues), whose performance upper bounds the performance of the MDS 
queue. The scheduling policies presented here relax the constraint of requiring the k jobs in a batch to be 
processed by k distinct servers. While the M^/M/n(t) scheduling policies and the M^/M/n(t) queues are not 
realizable in practice, they are presented here only to obtain upper bounds to the performance of the MDS queue. 

The M^/M/n(t) scheduling policy, in a nutshell, is as follows: 
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Fig. 9: Illustration of the working of the M^/M/n(0) scheduling policy. This policy allows a server to process 
more than one jobs of the same batch. As shown in the figure, server 1 processes both C\ and C%. 



"apply the MDS scheduling policy whenever there are t or fewer batches in the buffer; when there are more 
than t batches in the buffer, ignore the restriction requiring the k jobs of a batch to be processed by distinct 
servers." 

We first describe theM^/M/nCO) queue in detail, before moving on to the general M^/M/n(t) queues. 



A. M k /M/n(0) 

1) Scheduling policy: The M^/M/n(0) scheduling policy operates by completely ignoring the restriction of 
assigning distinct servers to jobs of the same batch. This is described formally in Algorithm [5] 

Algorithm 5 M k /M/n(0) 
on arrival of a batch 
assign jobs of this batch to idle servers (if any) 
append remaining jobs at the end of the buffer 
on departure from a server 
if buffer is not empty then 

assign a job from the first waiting batch to this server 
end if 



Note that the M^/MA^O) queue is identical to the M^/M/n queue, i.e., an M/M/n queue with batch arrivals. 
The following example illustrates the working of the M^/M/n(0) scheduling policy. 

Example 5: Consider the MDS(n=4,k=2) queue in the state depicted in Fig. |3a| Suppose the first server 



completes processing job A2, as shown in Fig. |9a| Under the M^/M/n(0) scheduling policy, server 1 now takes 



up job C\. Next suppose server 1 also finishes this task before any other server completes service (Fig. 9b 1. In 
this situation, the MDS scheduling policy would prohibit job C2 to be served by the first server. However, under 
the scheduling policy of M k /M/n(0), we relax this restriction, and permit server 1 to begin processing C%- 

2) Analysis: We now describe a Markovian representation of the M^/M/n(0) scheduling policy, and show that 
it suffices to keep track of only the total number of jobs m in the system. 
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Fig. 10: State transition diagram of the M^/M/n(0) queue for n = 4 and k = 2. The notation at any state is the 
number of jobs m in the system in that state. The set of boundary states are {0,1,2,3,4}, and the levels are pairs 
of states {5,6}, {7,8}, etc. The transition matrix is of the form {j} with B = [0 4/x ;0 0], 

Bi = [-A A 0; n -(jj,+X) A 0; 2fj, -(2/j+A) A; 3/i -(3/i+A) 0; 4/i -(4/i+A)], 
B 2 = [0 0; 0; 0; AO; A], A = [0 4/j ; 0], = [-(4//+A) 0; 4/x -(4/i+A)], A 2 = [A ; A]. 

Theorem 5: The Markovian representation of the M^/M/r^O) queue has a state space {0,1,..., oo}, and any 
state me {0,1,.. .,oo}, has transitions (i) to state (m+k) at rate A, and (ii) if m>0, then to state (m — 1) at rate 
min(n,m)/i. It is a QBD process with boundary states {0,...,k} x{0,...,n}, and levels {0,...,A;} x {n— &+1+ 
jk,...,n+jk} for j = {1,2,. ..,oo}. 

The state transition diagram of the M^/M/n(0) queue for n = 4,/c = 2 is shown in Fig. 
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Theorem [5] shows that the MDS-Reservation(O) queue is a QBD process, allowing us to employ the SMC 
solver to obtain its steady-state distribution. Alternatively, this queue is simple enough to analyse analytically as 
well. Let 7r m denote the stationary probability of any state me {0,1, ...,oo}. Then, for any mS{l,...,oo}, the 
global balance equation for the cut between states (m— 1) and m gives: 



A 



7T« 



m—1 



minim, n) ii 



(3) 



The stationary distribution of the Markov chain can now be computed easily from these equations. 



B. M k /M/n(t) for a general t 

1 ) Scheduling policy: Algorithm ^ formally describes the M^/M/n(t) scheduling policy. 
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Fig. 11: Illustration of the working of the M^/M/n(l) scheduling policy. This policy allows a server to begin 
processing a job of a batch that it has already served, unless this batch is the only batch waiting in the buffer. 
As shown in the figure, server 1 cannot process C2 in (b) since it has already processed C\ and C is the only 
waiting batch; this restriction is removed upon on arrival of another batch in the buffer in (d). 



Algorithm 6 M k /M/n(t) 
on arrival of a batch 
if buffer has strictly fewer than t batches then 

assign jobs of new batch to idle servers 
else if buffer has t batches then 

assign jobs of first batch to idle servers 
if first batch is cleared then 

assign jobs of new batch to idle servers 
end if 
end if 

append remaining jobs of the new batch to the end of the buffer 
on departure of job from a server (say, server s) 
if number of batches in buffer is strictly greater than t then 
assign job from first batch in buffer to this server 

else 

among all batches in the buffer that have not been served by s, find the one that arrived earliest 
assign a job of this batch to s 
end if 



Example 6 : (t- 1). Consider a system in the state shown in Fig. 11a Suppose server 1 completes execution of 
job C\ (Fig. lib I. In this situation, the processing of C2 by server 1 would be allowed under M^/M/r^O), but 
prohibited in the MDS queue. The M' i /M/n(l) queue follows the scheduling policy of the MDS queue whenever 
the total number of batches in the buffer is no more than 1, and hence in this case, server 1 remains idle. Next, 



suppose there is an arrival of a new batch (Fig. llci. At this point there are two batches in the buffer, and the 
M^/M/n(l) scheduling policy switches its mode of operation to allowing any server to serve any job. Thus, the 



first server now begins service of C2 (Fig. lldi. 



2) Analysis: 



Theorem 6: The state transition diagram of the M /M/n(t) queue has a state space {0,1,. ..,&}* x {0,1,2,...}. It 
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Fig. 12: An example of the Replication-II storage scheme. Here, n = 4 and k = 2. Contrast this to the Replication-I 
storage scheme depicted in Fig. la for the same parameters. 



is a QBD process with boundary states {0,. . . ,/c}* x {0,. . . ,n+tk}, and levels {0,...,fe}* x {n — k+l+jh,.. .,n+ 
iA;} for j = {t+l,t + 2,...,oo}. 

As in the case of MDS-Reservation(t) queues, one can see that the sequence of M^/M/n(t) queues, as t increases, 
becomes closer to the MDS queue. On increase in the value of parameter t, the bounds become tighter, but the 
complexity of the transition diagrams also increases, and the limit of this sequence is the MDS queue itself. 

Theorem 7: The M^/M/n(t) queue, when t = oo, is precisely the MDS(n,k) queue. 

Remark 1: The class of queues presented in this section have another interesting intellectual connection with 
the MDS queue: the performance of an MDS(n,k) queue is lower bounded by the M^/M/(n-k+l)(t) queue for 
any value of t. 



VI. Performance Analysis and Comparison of Various Queues 

In this section we analyse the performance of the MDS-Reservation(t) queues and the M^/M/n(t) queues using 
the properties of QBD processes. These form lower and upper bounds respectively to the performance of the 
MDS queue. We also provide a performance analysis of the MDS queue via MCMC simulations. We compare 
the performance of these queues with replication-based schemes. 

A replication-based scheme stores multiple copies of the entire data, and hence must have n as a multiple of 
k. As in systems employing MDS codes, we assume that under a replication-based scheme, each server has a 
capacity of storing I of the total data. Replication can be of two types: 



Replication-I (depicted in Fig. la i: The n servers are partitioned into a k sets of r servers each, and the set 
of files are partitioned into k sets of equal size. Each set of files is associated to one set of servers, and is 
replicated in the ? servers in this set. Thus, a request for reading a file can be completed by any of the § 
servers in the set of servers to which the file belongs. 

Replication-II (depicted in Fig. 12 1: The n servers are partitioned into a k sets of | servers each, and each 



individual file is split into k chunks of equal size. For each file, the i th chunk (1 < i < k) is replicated in the 
? servers of the i th set of servers. Thus, a request for reading a file is split into k identical jobs, and the I th 
job can be served by any of the ? servers in the z th set of servers. 

In our study, we observed that Replication-I consistently performs poorer than Replication-II, under the models 
considered here. Thus, in the analysis to follow, we plot only the performance of Replication-II scheme, and 
omit Replication-I. Furthermore, in the analysis of the latency incurred by coded systems, we assume that the 
additional delay incurred due to computations required for decoding the data is negligible. This assumption is 
justified in most systems of interest, since the codes employed typically have small block lengths, allowing for 
fast decoding. 
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The following subsections present the analysis and comparisons of the queues with respect to various metrics. 
Unless mentioned otherwise, the values of the system parameters for the graphs plotted are 

n = 10, k = 5, and fi = l . 

The system is assumed to be in steady-state. The curves marked '[MCMC]' have been obtained via MCMC 
simulations, while the rest have been obtained analytically. 



A. Maximum throughput 

The maximum throughput is the maximum possible number of requests that can be served by the system per 
unit time. 

Theorem 8: For any given (n,k) and t > 1, let A£ esv(t) , X* MDS , and ^* M k/ M / n ^ denote the maximum through- 
puts of the MDS-Reservation(t), MDS, and M k /M/n(t) queues respectively. Then, 

Ti 

Kids = Ki k /M/n(t) = ^ ■ ( 4 ) 

When k is treated as a constant, 

(l-0(n- 2 ))^/i < A£ esv(t) < (5) 

In particular, when k = 2, 

1 \ n 
2n 2 -2n+l ) V 



1 o„2 n„ i i ) T.^ ~ -^Resv(l) — ^Resv(t) ) (6) 



and when k = 3, 



! 4n 3 -8n 2 + 2n+4) \n _ 

1 3nS-12n4 + 22n3-29n 2 +26n-8 ) k ~ Resv(1) " Resv(t) ' U) 



The maximum throughput of Replication-I and Replication- II is also ^ p,. 

Note that the special case of Theorem [8} for the MDS-Reservation(l) queue with k = 2, also recovers the 
throughput results of |7]. Moreover, as compared to the proof techniques presented in [7], the proofs in this 
paper are simpler and do not require computation of the stationary distribution. Using the techniques presented 
in the proof of Theorem [8] bounds analogous to (|6]) and (|7]) can be computed for k > 4 as well. 



Fig. 13 plots loss in maximum throughput incurred by the MDS-Reservation(l) and the MDS-Reservation(2) 



queues, as compared to that of the MDS queue (and replication). 



B. System occupancy 

The system occupancy at any given time is the number of jobs present in the system at that time. This includes 
jobs that are waiting in the buffer as well as the jobs being processed in the servers at that time. Fig. 14 plots 
the complementary cdf of the the number of jobs in the system in the steady state. Observe that the analytical 
upper and lower bounds of the MDS-Reservation(2) and the M^/M/n(l) queues respectively are extremely close 
to each other. 
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Fig. 13: Loss in maximum throughput incurred by the 
MDS-Reservation(l) and the MDS-Reservation(2) 
queues as compared to that of the MDS queue (and 
replication). 




Fig. 14: Steady state distribution (complementary cdf) 
of the system occupancy. 
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Fig. 15: Average latency faced by a job. 



30 



10 



-MDS-Reservation(l) 
-MDS-Reservation(3) 

Replication il [MCMC] 

M k /M/n(1) 
-MDS [MCMC] 




1.3 1.4 1.5 1.6 1.7 1.1 

Arrival Rate(>.) 



Fig. 16: Average latency faced by a batch. 



C. Average job latency 

The latency faced by a job is is the time from its arrival into the system till the time it completes being serviced 
at a server. Fig. 15 plots the average latency faced by a job in the steady state. Here, the average job latencies 
of the MDS-Reservation(t) and the M^/M/n(t) queues have been obtained analytically by finding the stationary 
distribution of the respective QBD processes, and applying Little's law. The average job latency of the MDS 
queue is plotted via an MCMC simulation. 
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D. Batch latency 

The latency faced by a batch is the time from its arrival into the system till the last of its k jobs completes 
service. 

1) Average: We have analytically computed the average batch latencies of the MDS-Reservation(t) and the 
M^/M/n(t) queues in the following manner. We first compute the steady state distribution -ir of the corresponding 
QBD process. As discussed in the previous sections, each state i is associated to a unique configuration of the 
jobs in the system. Now, the average latency d{ faced by a batch entering when the system is in state i can be 
computed easily as a dynamic program, by employing the Markovian representations of the queues presented 
previously. Since Poisson arrivals see time averages [21 ], the average latency faced by a batch in the steady state 
is given by Yli^idi- 

Fig. [16] plots the average latency faced by a batch in the steady state. Observe that the coded systems achieve 
upto 70% reduction in average latency as compared to replication. Also observe that the performance of the 
MDS-Reservation(t) scheduling policy, for t as small as 3, is extremely close to that of the MDS scheduling 
policy and to the upper bounding M^/M/n(l) scheduling policy. 

2 ) Tails: It is of considerable practical interest to analyse the tails distribution of the latency. Here, we perform 
this analysis via MCMC simulations. Fig. 17 plots the 99 th percentile of the distribution of the latency faced by 



a batch: for a given value of the arrival rate A, a curve takes value y\ if exactly 1% of the batches incur a delay 
greater than y\. Again, we see that coded systems achieve significant gains over replication. 



E. Waiting probability 

The waiting probability is the probability that, in the steady state, one or more jobs of an arriving batch will 
have to wait in the buffer and will not be served immediately. Fig. [T8]plots the waiting probability for the different 
queues considered in the paper. Observe how tightly the M^/M/n(l) and the MDS-Reservation(2) queues bound 
the waiting probability of the MDS queue. Also observe that replication fares much worse than codes, even for 
low arrival rates. 
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F. Blocking Probability 

We have assumed throughout the paper that the buffer has an infinite capacity. In this section, let us suppose 
that the buffer has a finite capacity of accommodating at most a certain number batches. Then, the blocking 
probability is the probability that (in the steady state) an arriving batch will find the buffer full and hence will 
have to be rejected. Fig. [19] plots the blocking probability of a system for various buffer sizes. Observe that the 
blocking probability of the coded system is consistently smaller than that of replication. 



G. Degraded Reads 

The coded system presented so far assumes that each incoming request desires one complete file F x , for 
some x€ {A,B,C,...}. As discussed previously in Section II-A4| in certain applications, incoming requests may 



sometimes require only a part of the file f x £, for some £E{l,...,k}. In this case, assuming the code is systematic, 
f x £ can be retrieved by reading it directly from the I th server. However, in the situation that server £ is busy, one 
may need to perform a 'degraded read', i.e., recover f x £ from the data stored in the remaining (n — 1) servers. 

Under any MDS code, a degraded read may be performed by reconstructing the entire file F x from the 
data stored in any k of the (n — 1) remaining servers, and subsequently extracting the desired chunk f x £ from 
it. An alternative means of performing degraded reads in MDS codes has been proposed recently in fT2j- 



|14|. In particular, practical codes, termed the 'product-matrix (PM) codes', have been proposed in [14], that 



can potentially provide significant speed ups to the degraded read operation. These codes are associated to an 
additional parameter d, and can recover f x £ of file F x by reading and downloading a fraction d _ l k+l of the 
data stored in any d of the remaining (n — 1) servers. For example, consider a system with parameters n = 6 and 
k = 2, that stores files encoded by a PM code with parameter d = 3. Then, a complete file can be recovered, as in 
any other MDS code, by downloading the corresponding chunks from any k = 2 servers. A part f x £ of a file may 
also be recovered by reading and downloading data that is half the size of f x £ from each of any d = 3 servers. 
This method of performing a degraded read is termed as a 'repair' operation, since it was first proposed in the 
context of repairing failed servers. 

In this paper, we employ the framework of the MDS queue developed in this paper, to compare the two 
aforementioned methods of performing degraded reads. We assume that each request desires to read chunk f x £ 
for some (uniformly) random x £ {A,B,...} and ££{l,...,k}. We further assume the time required to read data 
from any server to be exponentially distributed with a mean equal to the amount of data to be read, with each 
chunk f x £ assumed to be of unit size. Under this setting, the method of reconstructing F x from any k servers 
forms an MDS(n-l,k) queue (with /U = l), and the method of performing 'repair' forms an MDS(n-l,d) queue 

( with M = g4+i)- 



In Fig. 20 we plot the average latency incurred under these two methods of performing degraded reads, for 
the parameters n = 6, k = 2 and d = 3. One can see that the method of 'repair' performs consistently better as 
compared to reconstructing the entire file F x , thus corroborating the efficacy of these codes in terms of latency 
performance for degraded reads. 



H. Request-flooding 



Request-flooding (9J-1 1 1 1 is a method of (possibly) reducing the latency by sending a request to more than the 
minimum requisite number of servers. Under an (n,k) MDS code, a request-flooding strategy sends (redundant) 
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Fig. 19: Blocking probability when the buffer has a 
finite capacity. The arrival rate is A = 1.5. 




Fig. 20: Average latency during degraded reads. The 
parameters associated to this system are n = 6, k = 2. 
The service time is exponentially distributed with a 
mean proportional to the amount of data to be read. 
The parameter d associated to the product-matrix 
(PM) codes is set as d = 3. 



requests to some p (k<p<n) servers. Upon completion of service by any k of these p servers, the batch is 
considered to be served, and the services of the remaining (p—k) jobs of that batch are cancelled. The case of 
p = k corresponds to the absence of flooding, and corresponds to the case considered throughout the rest of this 
paper. We term p as the request-flooding parameter. Request-flooding in a coded storage system was previously 



analysed in |11| for the case p = n (in the absence of cancellation costs). 



In this section, via MCMC models, we analyse and compare the performance of request-flooding to a p 



number of servers. As discussed previously in Section II-A3 the cancellation of jobs would typically mandate 



a certain cost. In our simulations, we account for this in the following manner. Upon cancellation of a job at a 
server, we require the server to not accept any further jobs (and remain idle) for a time duration that is distributed 



exponentially with a rate Wp (loosely speaking, this amounts to a 10% cancellation cost). In Fig. 21 we compare 
the average batch latency incurred for various values of the flooding parameter pG {fc = 5,6,7,8,9,n = 10} under 
such a flooding scheme. Observe that when the rate of arrivals A is low, the average latency is minimized when 
requests are flooded to all n = 10 servers. However, as the arrival rate A increases, the benefits of flooding 
diminish, and at very high arrival rates it seems prudent to flood only to a small number of servers, or to not 
flood at all. 



The intuition behind the plots of Fig. 21 is as follows. Firstly, observe that flooding a batch to a larger number 
of servers ensures faster service for that batch. However, this also mandates additional resource utilization due 
to flooding and cancellation. The performance of flooding schemes at different arrival rates depends on the 
magnitude of the adverse effects of this additional utilization of resources. For instance, when the buffer is 
empty, the additional resource utilization due to flooding and cancellation does not affect the performance, since 
there is no other batch waiting for these resources. This intuitive argument explains the superior performance of 
flooding with a high p at low arrival rates. On the other hand, when the buffer has several batches waiting in it, 
the additional resource utilization by one batch prevents the waiting batches from utilizing these resources during 
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Fig. 21: Average batch latency with request- flooding (MCMC simulations), p is the flooding parameter. Cancelling 
a job at a server incurs a cancellation cost: the server cannot take up another job for a time duration that is 
exponentially distributed with a rate 10//. 

that time. This increase in the waiting times dominates the gains achieved by a faster service, and hence we 
observe a superior performance of smaller values of p when the arrival rate is high. We note that in any system, 
in general, such a crossover may occur, with the crossover point being dependent on the cost of cancellation. 

VII. Discussion 

In this paper, we study storage systems that employ Maximum-Distance-Separable(MDS) codes through the 
lens of queueing theory. The queues under such systems are termed the MDS queues, and an analysis of their 
latency performance is provided through analytical means as well as via MCMC-based simulations. Our analysis 
reveals the superior performance of codes compared to that of currently used replication-based schemes (upto 
70%) under the models considered here. The key insight is that, the property of being able to recover the data 
from any k of the servers that lends MDS codes a high storage efficiency and reliability, is also instrumental in 
providing them a low latency. This paper thus makes a case for the use of codes to store frequently accessed 
"hot" data. 

We present two classes of scheduling policies, the MDS-Reservation(t) and the M^/M/n(t) scheduling policies, 
that respectively lower and upper bound the MDS queue. We show that the queues resulting from these scheduling 
policies belong to the class of Quasi-Birth-Death (QBD) processes, and exploit the properties of the QBD 
processes in our analysis. While both these classes converge to the MDS queue when the parameter t=oo, we 
show that even for small values of t (as small as t= 3), these bounds are quite tight and suffice to characterize 
the precise performance of the MDS queue. The MDS-Reservation(t) scheduling policy represents a practical 
scheme that requires the maintenance of a much smaller state, and may be of independent interest in practical 
systems. 

We also use the framework of the MDS queue developed in this paper to compare different methods of 
performing degraded reads. We observe that the average latency incurred for a degraded read can be reduced 
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significantly by employing codes with efficient 'repair' properties, thereby providing a queueing-theoretic evidence 
for the efficacy of these codes. In the paper, we also study the efficacy of request flooding. We observe that when 
the arrival rates are low, flooding requests to a large number of servers yields a considerable improvement in the 
latency, while at high arrival rates it is better to flood to only a few servers or to not flood at all. 

In the future, we intend to build upon the framework presented in this paper, and analyse queues that relax 
one or more of the assumptions made in the paper: 

• General service times: more accurate modelling based on the underlying properties of the storage devices 
(e.g., (22)). 

• Heterogeneous requests: files of different sizes. 

• Heterogeneous servers: the n servers may not have identical service-time distributions. 

• Absence of MDS property: a storage system may alternatively employ codes that are not MDS (e.g., EJ), 
or store different files in different sets of n servers. In such situations, instead of being able to recover the 
data from any set of k servers, there will exist pre-specified subsets of the servers from which a file can be 
recovered. 

• Decentralized MDS queue: each server may have its own buffer, in which case, the k jobs of a batch must 
be sent to the buffers of k distinct servers, and the choice of these k servers may have to be made much 
before the jobs are actually serviced. 

Finally, it is also of utmost interest to connect the framework and results presented in this paper to traces from 
real-world data centers, which we plan to do in the near future. 
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Appendix 
Proofs of Theorems 

Proof of Theorem^ Since the scheduling policy mandates all k jobs of any batch to start service together, 
the number of jobs in the buffer is necessarily a multiple of k. Furthermore, when the buffer is not empty, the 
number of servers that are idle must be strictly smaller than k (since otherwise, the first waiting batch can be 
served). It follows that when m<n, the buffer is empty (b = 0), and all m jobs are being served by m servers (and 
z = {n — m) servers are idle). When m>n, the buffer is not empty. Assuming there are z idle servers, there must be 
(n — z) jobs currently being served, and hence there are m — (n — z) jobs in the buffer. However, since the number 
of jobs in the buffer must be a multiple of k, and since z G {0,1,.. .,k — 1}, it must be that z = (n — m) mod k. 
Thus, when m>n, there are b= m ~^ +z batches waiting in the buffer, and Wi = k, Si = V i G {l,...,b}. We have 
thus shown that the knowledge of m suffices to completely describe the system. 

Once we have determined the configuration of the system as above, it is now easy to obtain the transitions 
between the states. An arrival of a batch increases the total number of jobs in the system by k, and hence 
the transition from state m to (m+k) at rate A. When m<n, all m jobs are being served, and the buffer is 
empty. Thus, the total number of jobs in the system reduces to (m— 1) at rate m/i. When m>n, the number 
of jobs being served is n — z = n — {{n — m) mod A;), and thus there is a transition from state m to (m— 1) at rate 
(n— ((n—m) mod k))fL. ■ 

Proof of Theorem [2j Results as a special case of Theorem [3] As a side note, in any given state (701,771) G 
{0,l,...,/c} x {0,l,...,oo} of the resulting Markov chain, the number of idle servers is given by z = n — m if 
m<n — k, and z = {n+w\ — m) mod k otherwise. The state {w\,m) has transitions to state: 

• {{m+k — n) + ,m+k) at rate A, if w\ = 0. 

• {wi,m + k) at rate A, if tui^O 

• ( w;i,m — 1) at rate m[A, if w\ = 0. 

• (wi,m— 1) at rate {k— w\— z)fi, if wi^O 

• (i«i — l,77i— 1) at rate (ti— k+wi)fi, if {w\ > 1 or {w\ = 1 & 77i<n+l)) 

• {{k — z) + ,m — 1) at rate (n— k+wi)fL, if (u;i = 1 & m>n+\). 
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Proof of Theorem^ For any state of the system (wi,W2,---,wt,m) G {0,1,.. .,&}* x {0,1,. ..,00}, define 

{0 if ioi = 

t else if w t y^0 

arg max{i' : uv ^ 0, 1 < if < t} otherwise. 



(8) 



It can be shown that 







t+ 



if q = 
if 0<q<t 
otherwise, 



and 



z = n — {m — ' S ^ j Wj — {b — t) + k) , 
3=1 

{Wi+i-Wi if ie{l,...,q-l} 
k-z-w q if i = </ 
if ie{g+l,...,6}, 

for i G {i+1,. ..,&}, Wi = k . 



(9) 



(10) 



(11) 



(12) 



Given the complete description of the state of the system as above, the characterization of the transition diagram 
is a straightforward task. 

It is not difficult to see that the MDS-Reservation(t) queue has the following two key features: (a) any transition 
event changes the value of m by at most k, and (b) for m>n — k+l+tk, the transition from any state (w\,m) 
to any other state {w' l: m! >n — k+l+tk) depends on mmodfc and not on the actual value of m. This results 
in a QBD process with boundary states and levels as specified in the statement of the theorem. Intuitively, this 
says that when m>n — k+l+tk, the presence of an additional batch at the end of the buffer has no effect on 
the functioning of the system. (In contrast, when m <n — k + l+tk, the system may behave differently if there 
was to be an additional batch, due to the possibility of this batch being within the threshold t. For instance, a 
job of this additional batch may be served upon completion of service at a server, which is not possible if this 
additional batch was not present). 



Proof of Theorem [?}■ The MDS-Reservation(t) scheduling policy treats the first t waiting batches in the 
buffer as per the MDS scheduling policy, while imposing an additional restriction on batches (t + 1) onwards. 
When t = oo, every batch is treated as in MDS, thus making MDS-Reservation(oo) identical to MDS. ■ 

Proof of Theorem ^ Under M^/M/n(0), any job can be processed by any server, and hence a server may 
be idle only when the buffer is empty. Thus, when m<n, all m jobs are in the servers, and the buffer is empty. 
When m>n, all the n servers are full and the remaining (rn—n) jobs are in the buffer. The transitions follow as 
a direct consequence of these observations. It also follows that when m<n, 6 = and z = n — m. In addition, in 
state m(>n) it must be that w\ = (m— n)mod£;, b= l^jf 1 ], and for ie{2,. ..,&}, Wi = k. Thus the knowledge 
of m suffices to describe the configuration of the entire system. ■ 

Proof of Theorem^ For any state (wi,W2,---,wt,m), define q as in ([8]). The values of b, z, u>i, are identical 
to that in the proof of Theorem [3] Given the complete description of the state of the system as above, the 
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characterization of the transition diagram is a straightforward task. It is not difficult to see that the M^/M/n(t) 
queues have the following two key features: (a) any transition event changes the value of m by at most k, and 
(b) for m> n+l+tk, the transition from any state (wi,m) to any other state (w'^m! > n+l+tk) depends on 
m mod k and not on the actual value of m. This results in a QBD process with boundary states and levels as 
specified in the statement of the theorem. Intuitively, this says that when m> n+l+tk, the total number of 
waiting batches is strictly greater than t. In this situation, the presence of an additional batch at the end of the 
buffer has no effect on the functioning of the system. ■ 

Proof of Theorem^ The M^/M/n(t) scheduling policy follows the MDS scheduling policy when the number 
of batches in the buffer is less than or equal to t. Thus, M^/M/n(oo) is always identical to MDS. ■ 

Proof of Theorem [Sf In the MDS queue, suppose there are a large number of batches waiting in the buffer. 
Then, whenever a server completes a service, one can always find a waiting batch that has not been served 
by that server. Thus, no server is ever idle. Since the system has n servers, each serving jobs with times i.i.d. 
exponential with rates u, the average number of jobs exiting the system per unit time is nu. The above argument 
also implies that under no circumstances (under any scheduling policy), can the average number of jobs exiting 
the system per unit time exceed n\i. Finally, since each batch consists of k jobs, the rate at which batches exit 
the system is X* MDS = IT P er un ^ ti me - Since the M^/M/n(t) queues upper bound the performance of the MDS 
queue, X* MkfM/n(t) = *f for every t. 

We shall now evaluate the maximum throughput of MDS-Reservation(l) by exploiting properties of QBD 
systems. In general, the maximum throughput A* of any QBD system is the value of A such that: 3 v satisfying 
v T (^ +-4i+^2) = and v T A l = v T A 2 l, where 1 = [1 1 ••• 1] T . Note that the matrices A , A\ and A 2 are 
affine transformations of A (for fixed values of \i and k). Using the values of Aq, A\, A 2 in the QBD representation 
of MDS-Reservation(l), we can show that AR esv(1) > (l-0(n~ 2 ))^u. For t > 2, each of the MDS-Reservation(t) 
queues upper bound MDS-Reservation(l), and are themselves upper bounded by the MDS queue. It follows that 
^>ALsv(0>(l-O(n- 2 ))^fort>l. 

The value of A£ esv(t) can be explicitly computed for any value of n, k and t via the method described above. 
We perform this computation for k = 2 and k = 3 when t = 1 to obtain the result mentioned in the statement of 
the theorem. We show the computation for k = 2 here. 

When k = 2 and t = l, the j th level of the QBD process consists of states {0,1,2} x{n—l+2j,n+2j} for 
j>l. However, as seen in Fig. [7] several of these states never occur. In particular, in level j, only the states 
(l,n — 1+2 j), (l,n+2j) and (2,n+2j) may be visited. Thus, to simplify notation, in the following discussion 
we consider the QBD process assuming the existence of only these three states (in that order) in every level. 
Under this representation, we have 
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One can verify that the vector 
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satisfies 

v T (A +A 1 +A 2 ) = 0. (14) 

Thus, 

v T A l = n(n-l)n, (15) 

and 

v r A 2 l = A(n+^-) . (16) 
According to the properties of QBD processes, the value of A = Ar 6Sv(1) must satisfy v T ^4ol = v T ^2l- Thus, 

A Resv(i)- 2n2 _ 2n+1 -U 2n 2 -2n+lJ2 /X - 



